Segmentation-Free Word Embedding for Unsegmented Languages
نویسنده
چکیده
In this paper, we propose a new pipeline of word embedding for unsegmented languages, called segmentation-free word embedding, which does not require word segmentation as a preprocessing step. Unlike space-delimited languages, unsegmented languages, such as Chinese and Japanese, require word segmentation as a preprocessing step. However, word segmentation, that often requires manually annotated resources, is difficult and expensive, and unavoidable errors in word segmentation affect downstream tasks. To avoid these problems in learning word vectors of unsegmented languages, we consider word co-occurrence statistics over all possible candidates of segmentations based on frequent character n-grams instead of segmented sentences provided by conventional word segmenters. Our experiments of noun category prediction tasks on raw Twitter, Weibo, and Wikipedia corpora show that the proposed method outperforms the conventional approaches that require word segmenters.
منابع مشابه
Boosting N-gram Coverage for Unsegmented Languages Using Multiple Text Segmentation Approach
Automatic word segmentation errors, for languages having a writing system without word boundaries, negatively affect the performance of language models. As a solution, the use of multiple, instead of unique, segmentation has recently been proposed. This approach boosts N-gram counts and generates new N-grams. However, it also produces bad N-grams that affect the language models' performance. In...
متن کاملSegmentation-Based And Segmentation-Free Methods for Spotting Handwritten Arabic Words
Given a set of handwritten documents, a common goal is to search for a relevant subset. Attempting to find a query word or image in such a set of documents is called word spotting. Spotting handwritten words in documents written in the Latin alphabet, and more recently in Arabic, has received considerable attention. One issue is generating candidate word regions on a page. Attempting to definit...
متن کاملLanguage Model Based Arabic Word Segmentation
We approximate Arabic’s rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses ...
متن کاملSubstring Frequency Features for Segmentation of Japanese Katakana Words with Unlabeled Corpora
Word segmentation is crucial in natural language processing tasks for unsegmented languages. In Japanese, many outof-vocabulary words appear in the phonetic syllabary katakana, making segmentation more difficult due to the lack of clues found in mixed script settings. In this paper, we propose a straightforward approach based on a variant of tf-idf and apply it to the problem of word segmentati...
متن کاملBLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU or NIST, are now well established. Yet, they are scarcely used for the assessment of language pairs like English-Chinese or English-Japanese, because of the word segmentation problem. This study establishes the equivalence between the standard use of BLEU in word n-grams and its application at the character level. T...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017